Closed Bug 709193 Opened 13 years ago Closed 13 years ago

Win PGO builds hitting 3GB virtual address space limit, failing with: "nswindowmediator.cpp(821) : fatal error C1001: An internal error has occurred in the compiler. LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage"

Categories

(Firefox Build System :: General, defect)

x86
Windows Server 2003
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: khuey)

References

Details

(Whiteboard: [see comment 36 & comment 62])

Twice I've hounded patches that seemed to be the cause out of the tree, only to have them demonstrate that either another build while they were still in, or a try build, showed that they were not the cause, and then reland.

I don't have any idea what is the cause, or how to determine it, but inbound is currently totally hosed and unable to merge as a result.

https://tbpl.mozilla.org/php/getParsedLog.php?id=7856152&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=7856634&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=7853995&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=7856443&tree=Mozilla-Inbound
we believe the cause may be
59c363453713
Ehsan Akhgari – Bug 703444. Port SPS profiler to Windows. r=jmuizelaar

Waiting for some confirmation then someone will backout.
though the fact UX has the same error, but doesn't have that patch, makes things foggy.
There's an even earlier occurence on UX, from the Wednesday December 7 nightly:
https://tbpl.mozilla.org/php/getParsedLog.php?id=7797830&tree=UX
Khuey suggested taking a look at bug 543034. That's interesting since the symptoms are similar here:
- we have a failure across different trees
- different patches seem to be able to cause the same failure
- we have a green in the middle of a lot of reds, and that's a backout
- the push immediately after the green adds code, and it's red

So it's possible we are on a boundary, and any new code causes red, while the backout went green because it actually removed code, bringing us below the limit.
Seems likely.

If the Windows builders are updated to Win64, then we can probably get 4GB of address space for the linker instead of 3GB. That might fix this, for a while.
BTW you could test this theory by disabling some random component (e.g., --disable-webm) and see if it fixes the build bustage.
Our current theory is that we're running out of virtual address space in the linker (again :-/).

We're already operating the PGO builds in a configuration that gives them 3 GB of virtual address space.  If this is what's going wrong our only options are to split libxul apart or to move to doing 32 builds on 64 bit windows builders.
Component: Build Config → Release Engineering
Product: Firefox → mozilla.org
QA Contact: build.config → release
Version: Trunk → other
Watching this to keep on top of s-c merges.
Nothing has changed on our windows build machines in the last day and mozilla-central is still green.  This looks like its a code issue with something on mozilla-inbound.
Component: Release Engineering → Build Config
Product: mozilla.org → Firefox
QA Contact: release → build.config
Version: other → unspecified
Oddly, today we have a green in the midst of reds that was NOT a backout.
(In reply to Bill Gianopoulos from comment #11)
> Oddly, today we have a green in the midst of reds that was NOT a backout.

yes PGO can cause any sort of randomness.

btw, this happens on UX branch at least from 7th, and that code is already in central.
Or better, UX is based on the same code that is green in central
To hopefully confirm the theory in comment 5, I've pushed the following to try:

* inbound tip + MOZ_PGO=1 (to confirm it does fail as a clobber on try):
  https://tbpl.mozilla.org/?tree=Try&rev=e2ca2498c427

* inbound tip + MOZ_PGO=1 + disable accessibility/webm:
  https://tbpl.mozilla.org/?tree=Try&rev=3c0d1907b02e

* inbound tip + MOZ_PGO=1 + msvc2010:
  https://tbpl.mozilla.org/?tree=Try&rev=e8c57013b187

* inbound tip + MOZ_PGO=1 but with ehsan's SPS patch backed out (bug 703444):
  https://tbpl.mozilla.org/?tree=Try&rev=bde21781ca39

* m-c tip + MOZ_PGO=1 + ehsan's SPS patch applied:
  https://tbpl.mozilla.org/?tree=Try&rev=d62a63cb667e
Interesting results.
 
> * inbound tip + MOZ_PGO=1 (to confirm it does fail as a clobber on try):
>   https://tbpl.mozilla.org/?tree=Try&rev=e2ca2498c427

2 greens out of 6 builds.

> * inbound tip + MOZ_PGO=1 + disable accessibility/webm:
>   https://tbpl.mozilla.org/?tree=Try&rev=3c0d1907b02e

All green
 
> * inbound tip + MOZ_PGO=1 + msvc2010:
>   https://tbpl.mozilla.org/?tree=Try&rev=e8c57013b187

All green
 
> * inbound tip + MOZ_PGO=1 but with ehsan's SPS patch backed out (bug 703444):
>   https://tbpl.mozilla.org/?tree=Try&rev=bde21781ca39

One green out of 6 builds

> * m-c tip + MOZ_PGO=1 + ehsan's SPS patch applied:
>   https://tbpl.mozilla.org/?tree=Try&rev=d62a63cb667e

4 greens, 2 still running.

This seems to indicate two things:
- code size is a factor, but randomness makes it work sometimes.
- MSVC2010 uses less memory.
(In reply to Mike Hommey [:glandium] from comment #15)
> > * m-c tip + MOZ_PGO=1 + ehsan's SPS patch applied:
> >   https://tbpl.mozilla.org/?tree=Try&rev=d62a63cb667e
> 
> 4 greens, 2 still running.

Err, 4 reds (and now one green).
Two other strategies:

- Building intermediate libraries (in case the linker behaves better). Note this may fail to build completely for other reasons, but that should be fixable.
https://tbpl.mozilla.org/?tree=Try&rev=ed91f23b4488

- Building without SPDY (backout of all changesets listed by hg log -k SPDY)
https://tbpl.mozilla.org/?tree=Try&rev=60c146c38886
So, question for RelEng, can we do PGO builds on Win64 boxes per the suggestion on comment 6? Would not be the final fix but would give us some breath with 1 additional GB of virtual space
Going from 2GB to 3GB bought us nearly two years. Hopefully 3GB to 4GB would buy us another couple of years. VS2010 helps too apparently so that might give us more. Maybe by then we'll be more focused on 64bit builds and can disable PGO on 32-bit...
(In reply to Mike Hommey [:glandium] from comment #17)
> Two other strategies:
> 
> - Building intermediate libraries (in case the linker behaves better). Note
> this may fail to build completely for other reasons, but that should be
> fixable.
> https://tbpl.mozilla.org/?tree=Try&rev=ed91f23b4488

I distinctly remember being up till 3 am one time getting rid of intermediate libraries so we could fix an address space exhaustion problem.

> - Building without SPDY (backout of all changesets listed by hg log -k SPDY)
> https://tbpl.mozilla.org/?tree=Try&rev=60c146c38886

This is basically "if we get rid of a big chunk of code do things start working again" test, right?
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #19)
> Going from 2GB to 3GB bought us nearly two years. Hopefully 3GB to 4GB would
> buy us another couple of years. VS2010 helps too apparently so that might
> give us more. Maybe by then we'll be more focused on 64bit builds and can
> disable PGO on 32-bit...

Unfortunately, this takes time to put in place. I don't think we can wait that long. Which brings us to ...

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #20)
> > - Building without SPDY (backout of all changesets listed by hg log -k SPDY)
> > https://tbpl.mozilla.org/?tree=Try&rev=60c146c38886
> 
> This is basically "if we get rid of a big chunk of code do things start
> working again" test, right?

If we can buy a few days/weeks by disabling/removing big chunks, I think we should go for it now, until we can do something better.
The other option would be "Disable PGO for Windows builds" until MSVC 2010 is ready, right?
(In reply to Mike Hommey [:glandium] from comment #22)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #19)
> > Going from 2GB to 3GB bought us nearly two years. Hopefully 3GB to 4GB would
> > buy us another couple of years. VS2010 helps too apparently so that might
> > give us more. Maybe by then we'll be more focused on 64bit builds and can
> > disable PGO on 32-bit...
> 
> Unfortunately, this takes time to put in place. I don't think we can wait
> that long. Which brings us to ...

Agreed.

> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #20)
> > > - Building without SPDY (backout of all changesets listed by hg log -k SPDY)
> > > https://tbpl.mozilla.org/?tree=Try&rev=60c146c38886
> > 
> > This is basically "if we get rid of a big chunk of code do things start
> > working again" test, right?
> 
> If we can buy a few days/weeks by disabling/removing big chunks, I think we
> should go for it now, until we can do something better.

Agreed.

(In reply to Emanuel Hoogeveen from comment #23)
> The other option would be "Disable PGO for Windows builds" until MSVC 2010
> is ready, right?

Right, the problem is that then we won't be building/testing the configuration we ship to 90+% of our users.
Oh, I see. Yes, lack of testing would be bad (unless it was only for a few days and nowhere near the split to aurora/beta).
Shortly after the bustage started, graphite landed too, (72 files changed, 16165 insertions(+), 0 deletions(-)), which would explain why backing out Ehsan's SPS patch from tip on it's own didn't work.

As such I suggest we:

1) Stop building graphite on windows for now (ie: https://hg.mozilla.org/try/rev/8a247ce99d61).
2) Back out Ehsan's windows-only SPS landing (https://hg.mozilla.org/integration/mozilla-inbound/rev/59c363453713).
3) Land the remove libreg patch that has been sitting around for a while (bug 679352).

I've pushed a try run with #1 and #2:
https://tbpl.mozilla.org/?tree=Try&rev=1cb485c4f804
...and #3 is pending an unbitrot, which is underway.

Since the above will only stop the problem momentarily, we need to:
  a) Request that no large libxul landings occur for the rest of this cycle (suboptimal, but our only other option is holding the trees closed for 10 days).
  b) Run PGO builds more often than 4 hours (bug 709192 or maybe a compromise between the two extremes), so we can spot offending landings more quickly.

This will at least allow the trees to reopen and fixes to land, even if large new components can't.

In parallel with the above:
* Have a concerted effort to look for more unused code quick wins (bug 457262 and friends).
* Get the ball rolling on doing 32 builds on 64 bit windows builders (let's file this as another bug).
* Once Firefox 9 is released (now only 10 days) look at the service pack stats that come in for the ~week thereafter, and hope they are able to convince the product team that bug 563318 is viable for Firefox 12.
Oh and also:
* Experiment with splitting libxul apart (comment 8).

CC'ing Ehsan and Jonathan, since comment 26 involves backing out/preffing off their landings.
Summary: Inbound keeps getting "nswindowmediator.cpp(821) : fatal error C1001: An internal error has occurred in the compiler. LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage" in Windows PGO builds → Win PGO builds hitting 3GB virtual address space limit, failing with: "nswindowmediator.cpp(821) : fatal error C1001: An internal error has occurred in the compiler. LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage"
Version: unspecified → Trunk
Depends on: PGOSilverBullet
Ehsan SPS patch isn't particularly big. Is it really worth backing out?
Agreed, at this point glandium and myself are looking at backing out SPDY and preffing Graphite off (awaiting try results).
Blocks: 609976
Just to keep anyone following along updated:

Preffing graphite off (in addition to backing out Ehsan's patch), wasn't sufficient:
https://tbpl.mozilla.org/?tree=Try&rev=1cb485c4f804

Results of backing out SPDY will be available in <90mins (or earlier, given that the failing builds end ~70mins quicker than those that succeed), tbpl link in comment 17.
FWIW, see bug 563318 comment 20, about msvc 2010.
Also as an FYI, I do my personal builds on Win7 x64 using MSVC2010 (32bit). Building m-c tip, my second pass link.exe Peak Working Set got up to about 1.92GB for xul.dll.

I don't have a fully standard build config, though. Relevant differences are:
--enable-optimize="-O2 -GL -arch:SSE2"
--disable-mochitest
--disable-tests
--disable-accessibility
--disable-crashreporter
--disable-installer
--disable-safe-browsing
--disable-updater

I would assume that the --enable-optimize if anything might increase memory usage. The rest of the disabled wouldn't have much effect on libxul linking since all of the above (except safe browsing) are linked separate from it anyway, yes? Also, is there a better measurement I should be using for address space?
Ok so the SPDY backout got 5/6 greens, so helps but needs some of the rest backed out too.

I'm going to (on inbound):
* Backout SPDY
* Stop building Graphite (pending green: https://tbpl.mozilla.org/?tree=Try&rev=14d52fea5abb)
* Land bug 679352 (pending green: https://tbpl.mozilla.org/?tree=Try&rev=e1b435a80f0d)
* Retigger a load of PGO on inbound

...and see what state we're in after that. Unless anyone has any objections?
Blocks: 707173
Blocks: 708305
Blocks: 707662
Blocks: 706236
Blocks: SPDY
Blocks: 631479
To further confirm the theory, I did what the compiler was suggesting to do about disabling optimization on the function where it is crashing. I pushed that change with the inbound tip before the backouts to try: https://tbpl.mozilla.org/?tree=Try&rev=c89cf617aa44

Sure enough, it crashed somewhere else, but this time with the more descriptive error "fatal error C1002: compiler out of heap space on pass 2", the same as bug 543034
Summary, since I've been asked by people who've been backed out / we'll need it for the coming days to point at from TBPL:

(Please note I'm not a build peer - I'm speaking with my inbound sheriff hat on; khuey/ted please correct me if I've misunderstood anything).

* Win PGO builds were failing whilst linking, due to running out of virtual address space. This happened a few years ago (bug 543034), and the fix then was to start using the /3GB switch, which upped the limit from 2GB to 3GB. However that's now still not enough.

* As a temporary measure (and as all the other options are medium-long term), a couple of large landings within the last week were backed out, to (a) confirm what the try runs were telling us, and (b) so that the trees could at least be reopened for everything but large libxul landings. After trying a few combinations of backouts on try, we settled on** preffing off Graphite & backing out SPDY (as there is no easy way to disable it from building).

* An already r+'d patch to remove libreg (bug 679352) was landed, to reduce the pressure further.

* This has turned Win PGO green again (https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=f5578fdc50ef). However, this only buys us a bit of extra time, and there still cannot be any large libxul landings until this is resolved :-((

* Our options:

  (a) Bug 457262: Kill off more unused code (still only stop-gap, but every little helps until the rest can be done). Ms2ger has spent today going through finding things that can be removed and has filed a load more bugs \o/. The more people we can have doing this, the better. Also, maybe now is a good time to kill RDF (bug 559505 and friends)?

  (b) Start using 64 bit builders for 32 bit windows builds, so the linker can use the full 4GB virtual address space. Non-trivial for releng, they can probably explain the hurdles/timeframe better than I. Have filed bug 709480.

  (c) Bug 563318: Switching to MSVC2010. Seems to use less memory during linking, so avoids the failures for now (comment 15), but no idea for how long without also doing (b). This is also blocked on Firefox 9 being released, since it's the first version that provides the service pack metrics, which the product-team require before making a decision (MSVC2010 will EOL WinXP SP1 and earlier). 

  (d) Experiment with splitting libxul apart (a la comment 8).

  (e) Turning off PGO for Windows builds. I'm presuming this would be too much of a perf loss, and so not viable, but it's been raised on IRC as an alternative to backing out people's work, so including it here for completeness.

** This is obviously really frustrating for the people who've been preffed off/backed out (sorry Patrick & Jonathan :-( ) - and it will obviously be the product team's call to decide what remains backed out. An extra apology to Patrick who got missed off the CC list additions in comment 27, and so wasn't aware of this bug until the backout comment on his SPDY bug.

Anyway, now to get some sleep finally :-)
Whiteboard: [see comment 36]
(In reply to Ed Morley [:edmorley] from comment #36)
> Summary, since I've been asked by people who've been backed out / we'll need
> it for the coming days to point at from TBPL:

Many thanks to Ed and all who've been working to figure out the issue here.... how very frustrating!

One comment:
> * This has turned Win PGO green again
> (https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=f5578fdc50ef). However,
> this only buys us a bit of extra time, and there still cannot be any large
> libxul landings until this is resolved :-((
It's not clear what would constitute a "large libxul landing", as opposed to a "medium-sized" or "small" one; but in any case, AFAIK we can't tell exactly how close we are to hitting this problem, and so we don't know how large of a landing would break the build. And it'll break just as surely after a series of tiny landings that add up over a period of days or weeks. A harmless-looking single-line change somewhere could be the straw that breaks the linker's back..... so until we have a real solution, _every_ landing that touches libxul risks hitting this problem again.
(In reply to Jonathan Kew (:jfkthame) from comment #37)
> A harmless-looking single-line change somewhere could be the straw that
> breaks the linker's back..... so until we have a real solution, _every_
> landing that touches libxul risks hitting this problem again.

Agreed.

As such, the decision on #developers has been to leave inbound closed (since it has quality-of-service guarantees that we can't make anymore), set mozilla-central to approval required - and for now not allow *any* (Windows) libxul landings that touch c/c++, unless they are code removals.

For the latest version of the tree approval rules, see:
https://etherpad.mozilla.org/sLWdJm0zwB
(corrections/feedback appreciated)
Whiteboard: [see comment 36] → [see comment 36 & comment 38]
Bug 709480 seems to be the only option that will fix the problem without collateral damage.
Maybe some bugs in these meta bugs could help: bug 105431, bug 651016.
About RDF removal, there is also bug 474043.
Depends on: 474043
I don't think its a good idea to rely on 709480. That sounds a pretty high risk solution.

In parallel we should see if we can shrink libxul and squeeze by until we find a real fix. Lets get work going on removing unnecessary stuff from libxul.
Bug 230675 - 'decom' of nsICacheVisitor.idl: saves 10% / 150K from nkcache_s.lib
Depends on: 709721
There's also bug 405407, that is waiting for a review for months.
(In reply to Ryan VanderMeulen from comment #32)
> Also, is there a better measurement I should be using for address space?

Now that I've figured out that the correct measurement is virtual bytes, I've done another build with PerfMon going to track the virtual byte usage. At peak, link.exe had 2.93GB reserved. This is with current m-c tip with all the various backouts/deletions included. In other words, don't count on MSVC2010 to buy you much if anything in terms of solving this issue.
(In reply to Ryan VanderMeulen from comment #44)
> (In reply to Ryan VanderMeulen from comment #32)
> > Also, is there a better measurement I should be using for address space?
> 
> Now that I've figured out that the correct measurement is virtual bytes,
> I've done another build with PerfMon going to track the virtual byte usage.
> At peak, link.exe had 2.93GB reserved. This is with current m-c tip with all
> the various backouts/deletions included. In other words, don't count on
> MSVC2010 to buy you much if anything in terms of solving this issue.

Thanks for doing that.  It's great to know that MSVC 2010 alone won't help much.
I've split some of intl/ out of libxul, and we're going to try to reland SPDY on top of that and see where we are.
Assignee: nobody → khuey
Note that despite the virtual memory measurements, it's possible that Microsoft has fixed bugs in the compiler so that it won't always crash when it hits this situation. It may be that it can map/unmap files as needed, for example. That would explain successful VC 2010 builds where VC 2005 has failed.
(In reply to Ted Mielczarek [:ted, :luser] from comment #47)
> Note that despite the virtual memory measurements, it's possible that
> Microsoft has fixed bugs in the compiler so that it won't always crash when
> it hits this situation. It may be that it can map/unmap files as needed, for
> example. That would explain successful VC 2010 builds where VC 2005 has
> failed.

One way to make sure of that is to either profile VC 2010 or limit the amount of memory it can map.
It looks like most of the increase in libxul size from 10 to 11 is from Skia, which I'm told me can turn off if necessary.
So, bad news, splitting out uconv doesn't help.
(In reply to Ted Mielczarek [:ted, :luser] from comment #47)
> Note that despite the virtual memory measurements, it's possible that
> Microsoft has fixed bugs in the compiler so that it won't always crash when
> it hits this situation. It may be that it can map/unmap files as needed, for
> example. That would explain successful VC 2010 builds where VC 2005 has
> failed.

That's not correct, unfortunately. The patch in bug 609976 fails with the same error even with MSVC2010.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #49)
> It looks like most of the increase in libxul size from 10 to 11 is from
> Skia, which I'm told me can turn off if necessary.

If someone wants to throw up a disable Skia patch, I can see what difference it makes to virtual byte usage on an otherwise vanilla m-c build.
(In reply to Ryan VanderMeulen from comment #52)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #49)
> > It looks like most of the increase in libxul size from 10 to 11 is from
> > Skia, which I'm told me can turn off if necessary.
> 
> If someone wants to throw up a disable Skia patch, I can see what difference
> it makes to virtual byte usage on an otherwise vanilla m-c build.

Is this just a matter of backing out Bug 699258?
Depends on: 710473
I just landed a patch in bug 710473 which should enable us to reopen the tree.
It would be useful is somebody can do a PGO build measuring the address space of the linker with this changeset: https://hg.mozilla.org/mozilla-central/rev/221eccfa6a3f.
Whiteboard: [see comment 36 & comment 38] → [see comment 36 & comment 38 & comment 54]
(In reply to Ehsan Akhgari [:ehsan] from comment #55)
> It would be useful is somebody can do a PGO build measuring the address
> space of the linker with this changeset:
> https://hg.mozilla.org/mozilla-central/rev/221eccfa6a3f.

The try run in bug 710473 appears to be from mozilla-central and SPDY/Graphite were relanded on inbound only, so I've merged to inbound and will trigger PGO builds, to see if this works for sure :-)

https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=6c5cb98336ba
(In reply to Ehsan Akhgari [:ehsan] from comment #55)
> It would be useful is somebody can do a PGO build measuring the address
> space of the linker with this changeset:
> https://hg.mozilla.org/mozilla-central/rev/221eccfa6a3f.

My virtual bytes peak is currently 3,139,252,224 (2.92GB). I'm using a vanilla .mozconfig this time with no --enable-optimize set and nothing disabled.
Win PGO on the merge to inbound failed in the same old way.
I did the following measurements last night (with --disable-accessibility and --disable-angle, so only relative numbers are useful)

tip: 2.754 GB
no nsDerivedSafe: 2.748 GB
no Skia: 2.725 GB
aurora: 2.690 GB
Whiteboard: [see comment 36 & comment 38 & comment 54] → [see comment 36 & comment 38]
Status update, I've disabled Skia on Windows and hsivonen has landed a code removal patch so we're currently 32 MB below tip (that appears to be half a release cycle at the rate we climb :-/).

The WebGL and media separation patches are good for another 41 MB (together all of this would get us under Aurora's memory usage).
Situation is under control now, m-c is metered to clear the backlog, we're coordinating large landings until we can stand up better infrastructure.
Severity: blocker → critical
Both mozilla-central and mozilla-inbound are now fully open, no metering/sheriff approval required. 

However, please be gentle & coordinate large (read: importing brand new chunks of third party code or massive new features) C++ libxul landings with khuey, just to be safe.

Massive thanks to ehsan & mbrubeck for sheriffing mozilla-central last night, until the backlog had cleared.
Whiteboard: [see comment 36 & comment 38] → [see comment 36 & comment 62]
The backlog on the tree has been cleared, bugs are on file for everything else that needs to be done.

Many thanks to everyone who helped with this.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Depends on: 711386
This might be a silly question, but isn't it possible to use incremental linking? Or does this have other issues/the same memory issues?
Incremental linking is unrelated, it's about making subsequent links faster when you make changes.
Ftr, the following changeset is actually part of bug 709657:
http://hg.mozilla.org/comm-central/rev/f72675b4a48f
Revert changes from Bug 709193.
No longer blocks: SPDY, 631479, 706236, 707173, 707662, 708305
Blocks: 750661
No longer depends on: 711386
Component: Build Config → General
Product: Firefox → Firefox Build System
You need to log in before you can comment on or make changes to this bug.